Our Investment Bank client is looking for a Senior SRE role focused on monitoring, Kubernetes reliability, and observability to ensure resilient, scalable, high‑performing platforms to join their infrastructure team.
Key Responsibilities
- Lead reliability and observability across platforms, ensuring high availability and performance
- Design, implement, and enhance monitoring solutions using tools such as Prometheus, Grafana, and Elasticsearch
- Develop alerting strategies, dashboards, and end-to-end observability pipelines
- Diagnose complex production incidents through log analysis, troubleshooting, and root cause investigation
- Manage and optimize Kubernetes environments, including health checks, scaling, and workload stability
- Administer Linux systems (RHEL), covering upgrades, patching, and performance tuning
- Collaborate with engineering, infrastructure, and application teams to strengthen system resilience and scalability
- Maintain logging pipelines, including ingestion, parsing, and routing into search/analytics platforms
- Continuously evaluate and adopt modern SRE tools, practices, and automation approaches
- Participate in on-call rotations for production support, including off-hours coverage
Key Requirements
- Degree in Computer Science, Engineering, or related field
- 8–10 years’ experience in SRE, platform engineering, or production support environments
- Strong hands-on expertise in monitoring and observability tools (e.g., Prometheus, Grafana, Elasticsearch, Kibana)
- Proven experience building metrics pipelines, exporters, and integrations with long-term storage systems
- Solid experience with automation and scripting (Python, Bash, Ansible, CI/CD pipelines)
- Experience managing log processing pipelines (e.g., ingestion, filtering, enrichment)
- Proficient in designing dashboards and analytics for distributed systems
- Strong Linux administration knowledge, including troubleshooting and system optimization
- Hands-on Kubernetes experience (operations, orchestration, scaling, and troubleshooting)
- Understanding of SRE principles, incident management, high availability, and disaster recovery
- Knowledge of networking concepts and distributed system performance tuning
- Exposure to GPU-based or AI/ML infrastructure is advantageous
- Self-driven, adaptable, and capable of handling multiple priorities in a fast-paced environment
- Fluent in English; Cantonese and Mandarin language skills are a plus
“Sanderson-iKas” is the brand name for the following companies incorporated in Hong Kong: Sanderson Solutions International (Hong Kong) Limited (Business Registration no.53741924) and iKas International (Asia) Limited (Business Registration no.39818987)
Website: www.sanderson-ikas.hk